Variant calling in single molecule sequencing using a convolutional neural network

ABSTRACT

Systems and methods for variant calling in single molecule sequencing from a genomic dataset using a convolutional deep neural network. The method includes: transforming properties of each of the variants into a multi-dimensional tensor; passing the multi-dimensional tensors through a trained convolutional deep neural network to predict categorical output variables, the convolutional deep neural network minimizing a cost function iterated over each variant, the convolutional deep neural network trained using a training genomic dataset including previously identified variants, the convolutional neural network including: a plurality of pooled convolutional layers and at least two fully-connected layers connected sequentially after the last of the pooled convolutional layers, the at least two fully-connected layers comprising a second fully-connected layer connected sequentially after a first fully-connected layer; and outputting the predicted categorical output variables. In some cases, the categorical output variables include an alternate base, zygosity, variant type, and length of an indel.

TECHNICAL FIELD

The following relates generally to artificial neural networks and morespecifically to a system and method for variant calling in singlemolecule sequencing using a convolutional neural network.

SEQUENCE LISTING

The following application contains a sequence listing submitted as atext file in ASCII format, the name of the ASCII text file beingPA02069NPascii_ST25, the date of creation of the ASCII text file beingOct. 25, 2019, and the size of the ASCII text file being 2 KB. Thecontent of the ASCII text is hereby incorporated by reference.

BACKGROUND

The accurate identification of DNA sequence variants is an important,but challenging task in genomics. It is particularly difficult forsingle molecule sequencing, which has a per-nucleotide error rate ofapproximately 5% to 15%. It is necessary to accurately and efficientlydetermine variants so that the genomic variants that underlie phenotypicdifferences and disease can be correctly detected. However, thesequences upon which variation can be examined is often limited.

SUMMARY

In an aspect, there is provided a computer-implemented method forvariant calling in single molecule sequencing from a genomic datasetusing a convolutional deep neural network, the method comprising:obtaining variants of the genomic dataset; transforming properties ofeach of the variants into a multi-dimensional tensor; passing themulti-dimensional tensors through a trained convolutional deep neuralnetwork to predict categorical output variables, the convolutional deepneural network minimizing a cost function iterated over each variant,the convolutional deep neural network trained using a training genomicdataset comprising previously identified variants, the convolutionalneural network comprising: an input layer; a plurality of pooledconvolutional layers connected sequentially after the input layer, eachpooled convolutional layer taking an input, applying at least oneconvolutional operation to the input, and applying a pooling operationto the output of the convolutional operation; and at least twofully-connected layers connected sequentially after the last of thepooled convolutional layers, the at least two fully-connected layerscomprising a second fully-connected layer connected sequentially after afirst fully-connected layer; and outputting the predicted categoricaloutput variables.

In a particular case, for each variant, the categorical output variablescomprise: a first categorical output variable comprising an alternatebase at a single-nucleotide polymorphism; a second categorical outputvariable comprising zygosity; a third categorical output variablecomprising type; and a fourth categorical output variable comprisinglength of an insertion or deletion of bases.

In another case, the first categorical output variable is selected froma group comprising adenine (A), cytosine (C), guanine (G), and thymine(T), the second categorical output variable is selected from a groupcomprising Homozygote and Heterozygote, the third categorical outputvariable is selected from a group comprising Reference, SNP, Insertion,and Deletion, and the fourth categorical output variable is selectedfrom a group comprising 0, 1, 2, 3, 4, and greater than 4.

In yet another case, the first categorical output variable is outputfrom the first fully-connected layer, and the second categorical outputvariable, the third categorical output variable, and the fourthcategorical output variable are outputted from the secondfully-connected layer.

In yet another case, the first fully-connected layer comprises aregression analysis using a sigmoid activation function, and the secondfully-connected layer comprises a softmax classification analysis.

In yet another case, the plurality of pooled convolutional layerscomprises exactly three pooled convolutional layers.

In yet another case, the multi-dimensional tensor comprises a firstdimension corresponding to a position of the variant in addition toflanking base pairs, a second dimension corresponding to a count of abase on a sequencing read, and a third dimension corresponding to anumber of techniques for counting bases.

In yet another case, the first dimension comprises sixteen flanking basepairs on both sides of the variant for a total dimension ofthirty-three, the second dimension comprises adenine (A), cytosine (C),guanine (G), and thymine (T) for a total dimension of four, and thethird dimension comprises four techniques for counting bases for a totaldimension of four.

In yet another case, the four techniques comprise a first technique forencoding a reference sequence and a number of reads supporting referencealleles, and the second technique encodes inserted sequences, the thirdtechnique encodes deleted base-pairs, and the fourth technique encodesalternative alleles.

In another aspect, there is provided a system for variant calling insingle molecule sequencing from a genomic dataset using a convolutionaldeep neural network, the system comprising one or more processors andone or more non-transitory computer storage media, the one or morenon-transitory computer storage media causing the one or more processorsto execute: an input module to obtain variants of the genomic datasetand transform properties of each of the variants into amulti-dimensional tensor; a machine learning module to pass themulti-dimensional tensors through a trained convolutional deep neuralnetwork to predict categorical output variables, the convolutional deepneural network minimizing a cost function iterated over each variant,the convolutional deep neural network trained using a training genomicdataset comprising previously identified variants, the convolutionalneural network comprising: an input layer; a plurality of pooledconvolutional layers connected sequentially after the input layer, eachpooled convolutional layer taking an input, applying at least oneconvolutional operation to the input, and applying a pooling operationto the output of the convolutional operation; and at least twofully-connected layers connected sequentially after the last of thepooled convolutional layers, the at least two fully-connected layerscomprising a second fully-connected layer connected sequentially after afirst fully-connected layer; and an output module to output thepredicted categorical output variables.

In a particular case, for each variant, the categorical output variablescomprise: a first categorical output variable comprising an alternatebase at a single-nucleotide polymorphism; a second categorical outputvariable comprising zygosity; a third categorical output variablecomprising type; and a fourth categorical output variable comprisinglength of an insertion or deletion of bases.

In another case, the first categorical output variable is selected froma group comprising adenine (A), cytosine (C), guanine (G), and thymine(T), the second categorical output variable is selected from a groupcomprising Homozygote and Heterozygote, the third categorical outputvariable is selected from a group comprising Reference, SNP, Insertion,and Deletion, and the fourth categorical output variable is selectedfrom a group comprising 0, 1, 2, 3, 4, and greater than 4.

In yet another case, the first categorical output variable is outputfrom the first fully-connected layer, and the second categorical outputvariable, the third categorical output variable, and the fourthcategorical output variable are outputted from the secondfully-connected layer.

In yet another case, the first fully-connected layer comprises aregression analysis using a sigmoid activation function, and the secondfully-connected layer comprises a softmax classification analysis.

In yet another case, the plurality of pooled convolutional layerscomprises exactly three pooled convolutional layers.

In yet another case, the multi-dimensional tensor comprises a firstdimension corresponding to a position of the variant in addition toflanking base pairs, a second dimension corresponding to a count of abase on a sequencing read, and a third dimension corresponding to anumber of techniques for counting bases.

In yet another case, the first dimension comprises sixteen flanking basepairs on both sides of the variant for a total dimension ofthirty-three, the second dimension comprises adenine (A), cytosine (C),guanine (G), and thymine (T) for a total dimension of four, and thethird dimension comprises four techniques for counting bases for a totaldimension of four.

In yet another case, the four techniques for counting bases comprise afirst technique for encoding a reference sequence and a number of readssupporting reference alleles, and the second technique encodes insertedsequences, the third technique encodes deleted base-pairs, and thefourth technique encodes alternative alleles.

In another aspect, there is provided a computer-implemented method forvalidating variant calling from a genomic dataset received from aseparate variant caller, the method comprising: obtaining variants ofthe genomic dataset; receiving candidate categorical output variables ofthe variants from the separate variant caller; transforming propertiesof each of the variants into a multi-dimensional tensor; passing themulti-dimensional tensors through a trained convolutional deep neuralnetwork to predict categorical output variables, the convolutional deepneural network minimizing a cost function iterated over each variant,the convolutional deep neural network trained using a training genomicdataset comprising categorical output variables of previously identifiedvariants, the convolutional neural network comprising: an input layer; aplurality of pooled convolutional layers connected sequentially afterthe input layer, each pooled convolutional layer taking an input,applying at least one convolutional operation to the input, and applyinga pooling operation to the output of the convolutional operation; and atleast two fully-connected layers connected sequentially after the lastof the pooled convolutional layers, the at least two fully-connectedlayers comprising a second fully-connected layer connected sequentiallyafter a first fully-connected layer; and determining an output conditionfor each variant, the output condition comprising a positive conditionif all categorical output variables for the variant match the candidateoutput variables for the variant, otherwise the output conditioncomprising a negative condition; outputting the output conditions.

In a particular case, the method further comprising determining andoutputting a quality score, the quality score determination comprisingdetermining a distance between a best prediction and a second-bestprediction of the trained convolutional deep neural network for eachvariant.

These and other aspects are contemplated and described herein. It willbe appreciated that the foregoing summary sets out representativeaspects of a system and method for training a residual neural networkand assists skilled readers in understanding the following detaileddescription.

DESCRIPTION OF THE DRAWINGS

A greater understanding of the embodiments will be had with reference tothe Figures, in which:

FIG. 1 is a schematic diagram of a system for variant calling in singlemolecule sequencing using a convolutional neural network, in accordancewith an embodiment;

FIG. 2 is a flow chart of a method for variant calling in singlemolecule sequencing using a convolutional neural network, in accordancewith an embodiment;

FIG. 3 is a diagram of an example neural network architecture for thesystem of FIG. 1 ;

FIG. 4 shows representations of three types of small variants, and anon-variant, in accordance with the system of FIG. 1 ;

FIG. 5A illustrates an example visualization of heterozygote SNP from Tto G at chromosome 11, in accordance with the system of FIG. 1 ;

FIG. 5B illustrates an example visualization of heterozygote deletion AAat chromosome 20, in accordance with the system of FIG. 1 ;

FIG. 5C illustrates an example visualization of heterozygote insertionATCCTTCCT at chromosome 1, in accordance with the system of FIG. 1 ;

FIG. 5D illustrates an example visualization of heterozygote deletion Gat chromosome 1, in accordance with the system of FIG. 1 ;

FIG. 6 illustrates an example Venn diagram showing the number of knownvariants undetected by different combinations of sequencingtechnologies, in accordance with the system of FIG. 1 ;

FIG. 7 is a flow chart of a method of discriminating genomic variants,in accordance with an embodiment;

FIG. 8 illustrates an example Venn diagram of a variant set called bythe three callers, in accordance with the system of FIG. 1 ; and

FIG. 9 illustrates an example of counting A, C, G, or T on thesequencing reads.

DETAILED DESCRIPTION

Embodiments will now be described with reference to the figures. Forsimplicity and clarity of illustration, where considered appropriate,reference numerals may be repeated among the figures to indicatecorresponding or analogous elements. In addition, numerous specificdetails are set forth in order to provide a thorough understanding ofthe embodiments described herein. However, it will be understood bythose of ordinary skill in the art that the embodiments described hereinmay be practiced without these specific details. In other instances,well-known methods, procedures and components have not been described indetail so as not to obscure the embodiments described herein. Also, thedescription is not to be considered as limiting the scope of theembodiments described herein.

Any module, unit, component, server, computer, terminal or deviceexemplified herein that executes instructions may include or otherwisehave access to computer readable media such as storage media, computerstorage media, or data storage devices (removable and/or non-removable)such as, for example, magnetic disks, optical disks, or tape. Computerstorage media may include volatile and non-volatile, removable andnon-removable media implemented in any method or technology for storageof information, such as computer readable instructions, data structures,program modules, or other data. Examples of computer storage mediainclude RAM, ROM, EEPROM, flash memory or other memory technology,CD-ROM, digital versatile disks (DVD) or other optical storage, magneticcassettes, magnetic tape, magnetic disk storage or other magneticstorage devices, or any other medium which can be used to store thedesired information and which can be accessed by an application, module,or both. Any such computer storage media may be part of the device oraccessible or connectable thereto. Any application or module hereindescribed may be implemented using computer readable/executableinstructions that may be stored or otherwise held by such computerreadable media.

A fundamental problem in genomics is finding nucleotide differences inan individual genome relative to a reference sequence, often referred toas variant calling. It is generally ideal to accurately and efficientlycall variants so that the genomic variants that underlie phenotypicdifferences and disease can be sufficiently detected. Different datacharacteristics may contribute to higher variant calling performance;for example, properties of the sequencing instrument, quality ofpreceding sequence aligners, and alignability of a genome reference.Variant calling pipelines may use these characteristics, however, mostof these pipelines are only able to be used for short read sequencing.

Single Molecule Sequencing (SMS) approaches generate sequencing readstwo orders of magnitude longer than standard short-read Illuminasequencing (10 kbp to 100 kbp instead of ˜100 bp), but generally theyalso contain 5%-15% sequencing errors rather than ˜1% for Illumina.Single nucleotide and small indel variant calling with SMS remainssubstantially challenging because various approaches for variant callingfail to handle such a high sequencing error rate, especially oneenriched for indel errors. Advantageously, embodiments of the presentdisclosure provide, at least, single nucleotide and small indel variantcalling with SMS that produce a substantially reduced error rate.

Artificial Neural Networks (ANNs) can be used for DNA variant detectionby applying ANNs to analyzing images of aligned reads around candidatevariants. While some approaches use image classifiers, doing so may besub-optimal for variant calling because valuable information that couldcontribute to higher accuracy are lost during the image transformation.

In the present embodiments, a convolutional deep neural network isprovided for variant calling with SMS reads. Advantageously, the presentembodiments can, for example, predict variant type (SNP or indel),zygosity, alternative allele, and indel length from aligned reads. Thepresent inventors performed an example experiment that verified that thepresent embodiments can call variants in multiple human genomes both atcommon variant sites and genome-wide show are as accurate as approacheson Illumina data but with longer sequencing, and substantially moreaccurate and computationally efficient as other approaches on PacBio andONT data.

While the present embodiments are generally described as targeting SMS,it is understood that the present embodiments can also be generallyapplicable to short read data.

FIG. 1 shows various physical and logical components of an embodiment ofa system for variant calling in single molecule sequencing using aconvolutional deep neural network 100. As shown, the system 100 has anumber of physical and logical components, including a centralprocessing unit (“CPU”) 102 (comprising one or more processors), randomaccess memory (“RAM”) 104, an input interface 106, an output interface108, a network interface 110, non-volatile storage 112, and a local bus114 enabling the CPU 102 to communicate with the other components. TheCPU 102 executes executable instructions for implementing theconvolutional deep neural network, including various modules, asdescribed below in greater detail. The CPU 102 can itself be, or can beused in conjunction with, a graphical processing unit (GPU). RAM 104provides relatively responsive volatile storage to the CPU 102. Theinput interface 106 enables an administrator or user to provide inputvia an input device, for example a keyboard and mouse. The inputinterface 106 can be used to receive genomic data. In other cases, thegenomic data can be located on the database 116 or received via thenetwork interface 110. The output interface 108 outputs information tooutput devices, for example, a display 160 and/or speakers. The networkinterface 110 permits communication with other systems, such as othercomputing devices and servers remotely located from the system 100, suchas for a typical cloud-based access model. Non-volatile storage 112stores the operating instructions as well as data used by the system.Additional stored data can be stored in a database 116. During operationof the system 100, the instructions and data may be retrieved from thenon-volatile storage 112 and placed in RAM 104 to facilitate execution.In this embodiment, the system 100 is run on a client side device andaccesses content located on a server over a network, such as theinternet. In further embodiments, the system 100 can be run on any othercomputing device; for example, a desktop computer, a laptop computer, asmartphone, a tablet computer, a server, a smartwatch, distributed orcloud computing device(s), or the like.

In an embodiment, the CPU 102 is configurable to execute an input module150, a machine learning module 152, an output module 154, and adiscriminator module 156. As described herein, the machine learningmodule 152 is able to build and use an embodiment of a deepconvolutional neural network architecture.

In an embodiment, the system 100 can model a number of different typesof genomic variant characteristics with categorical variables. In aparticular case, at least one of four categorical variables can be used,including:

-   -   A∈{A, C, G, T} is an alternate base at a single-nucleotide        polymorphism (SNP), or the reference base otherwise;    -   Z∈{Homozygote, Heterozygote} is a zygosity of a variant;    -   T∈{Reference, SNP, Insertion, Deletion} is a variant type; and    -   L∈{0, 1, 2, 3, 4, >4} is a length of an insertion or deletion of        bases (indel), where ‘>4’ represents a gap longer than 4 base        pairs (bp).

In some cases, all the A, Z, T, and L variables can be used. In furthercases, any suitable subset of the A, Z, T, and L variables can be used.It is also understood that other suitable options for the variables canbe used; for example L∈{0, 1, 2, 3, 4, 5, >5}. In the above embodiment,L∈{0, 1, 2, 3, 4, >4} was selected by the present inventors because,generally, Indels longer than 4 are rare (<1% among Indels of allsizes).

Training data is used to train the convolutional neural network, asdescribed herein. The training data includes variants previouslyidentified with values for all the A, Z, T, and L variables, or thesubset of the A, Z, T, and L variables identified above. In some cases,each variable of the training data can be represented by a vector (forexample, with a one-dimensional tensor) using one-hot or probabilityencoding. Whereby: a_(b)=Pr{A=b}, z_(i)=δ(i, Z), t_(j)=δ(j, T) andl_(k)=δ(k, L), where δ(p, q) equals 1 if p=q, or 0 otherwise. The fourvectors (a, z, t, l) are the outputs of the convolutional neuralnetwork. In some cases, a_(b) is set to all zero for an insertion ordeletion. In some cases, multi-allelic SNPs are excluded from training,and base-quality is not used.

In the present embodiment, the system 100 seeks a function F: x→(a, z,t, l) that minimizes the cost C:

$C = {\frac{1}{N}{\sum\limits_{v}\left( {{\sum\limits_{b = 1}^{4}\left( {{\overset{\hat{}}{a}}_{b}^{(v)} - a_{b}^{(v)}} \right)^{2}} - {\sum\limits_{i = 1}^{2}{z_{i}^{(v)}\log{\overset{\hat{}}{z}}_{i}^{(v)}}} - {\sum\limits_{j = 1}^{4}{t_{j}^{(v)}\log{\overset{\hat{}}{t}}_{j}^{(v)}}} - {\sum\limits_{k = 1}^{6}{l_{k}^{(v)}\log{\overset{\hat{}}{l}}_{k}^{(v)}}}} \right)}}$where v iterates through all variants and a variable with a caretindicates it is an estimate from the neural network. Variable x is theinput of the network, and it can be of any suitable shape and containany suitable information. This embodiment uses an x that contains asummarized “piled-up” representation of read alignments, as describedherein.

Turning to FIG. 3 , a diagrammatic example of the convolutional neuralnetwork, according to the present embodiments, is shown. In FIG. 3 , thelabels under each layer include 1) the layer's function; 2) theactivation function used; 3) the dimension of the layer in parenthesis(Input layer: Height×Width×Arrays, Convolution layer:Height×Width×Filters, Fully connected layer: Nodes); and 4) kernel sizein brackets (Height×Width). In this example, the convolutional neuralnetwork is a multi-task five-layer convolution neural network with thelast two layers as feedforward layers. In this example, the neuralnetwork generates four groups of predictions on each input: 1)alternative alleles, 2) zygosity, 3) variant type, and 4) indel length.In this case, the predictions in groups 2, 3 and 4 are mutuallyexclusive while the predictions in group 1 are not. In this example, thealternative allele predictions can be determined directly from the firstfully connected layer (FC4), while the other three group of predictionscan be determined from the second fully connected layer (FC5). The indellength prediction group has six possible outputs indicating an indelwith a length between 0 to 3 bp or greater than 4 bp of any unboundedlength. In many cases, the prediction limit on indel length can beconfigurable and can be raised when more training data on longer indelsis provided. Advantageously, the convolutional neural network as shownhas been experimented with by the present inventors and found to besuccinct and fine-tuned for the variant calling purpose. It containsonly 1,631,496 parameters, about 13-times fewer than other approaches,for example, the DeepVariant approach using the Inception-v3 networkarchitecture.

Through extensive and arduously laborious testing, involving at leasthundreds of iterations of CNN architectures and combinations ofhyperparameters, the present inventors arrived at the above CNNarchitecture exemplified in FIG. 3 . This architecture was determined tobest capture genomic features from sequence alignment for variantcalling, as described herein. For example, this architecture achievedthe best F1-score (the harmonic mean of precision and recall) onnumerous test cases (those variants kept for performance testing and nottraining). During such testing, the CNN architectures were determinedwith hyperparameters having some general directions, for example, usingthree convolutional layers to reduce the dimension of input. Accordingto the testing, three layers were approximately enough to generateabstracted features of the read alignments in input. According to thetesting, this quantity of layers was able to achieve a good performancewhile also running sufficiently fast. The two fully connect layers wereused to perform a non-linear regression on the abstracted features. Inthe testing, the number of nodes in the two fully connected layers wereset to maximize the AUC (area under curve) on the test cases while notsignificantly increasing the computation time.

Advantageously, the multi-task architecture of the present embodimentsis used to solve variant calling in just a single model, which was asubstantial technical challenge in bioinformatics. As described herein,such architecture significantly increased the computational speed by,for example, using just a single model for all tasks. The multi-taskarchitecture also increased the performance of the CNN architecturegenerally by lowering the chance of overfitting to a single task.Additionally, the present inventors identified the technical advantagesof having the output variable A output by the first fully-connectedlayer, for example, from recognizing the diploid nature of humangenomics whereby only two different bases or just a single base ispossible; while the other three output variables, Z, T and L, can beoutput by the second fully-connected layer.

As shown in FIG. 3 , in an embodiment, for each input sample,overlapping sequencing read alignments can be transformed into amulti-dimensional tensor x. In a particular case, the multi-dimensionaltensor x can have a shape of 33 by 4 by 4. The first dimension ‘33’corresponds to the position of the variant. The second dimension ‘4’corresponds to the count of A, C, G, or T on the sequencing reads, wherethe method of counting is subject to the third dimension. The thirddimension ‘4’ corresponds to four different ways of counting. In thisembodiment, for the first dimension, 16 flanking base-pairs were addedto both sides of a candidate (for a total of 33 bp). The presentinventors measured this dimension to be sufficient to manifestbackground noise while providing good computational efficiency. For thesecond dimension, counts were separated into four bases. For the thirddimension, four different ways of counting were used, generating fourtensors of shape 33 by 4. The first tensor encodes the referencesequence and the number of reads supporting the reference alleles. Thesecond, third and fourth tensors use the relative count against thefirst tensor: the second tensor encodes the inserted sequences, thethird tensor encodes the deleted base-pairs, and the fourth tensorencodes alternative alleles. FIG. 4 illustrates an example of how thetensors can represent a SNP, an insertion, a deletion, and a non-variant(reference), respectively. The non-variant in FIG. 4 also depicts anexample of background noise. Advantageously, the convolutional neuralnetwork of the present embodiments decouples a substitution andinsertion signal into separate arrays and allows for the preciserecording of an allele of an inserted sequence.

In the above case, flanking bases can be used to provide localizedbackground noise of the target variant. Bases to the right of a targetvariant can also be used to provide a “length” and “the exact insertedor deleted bases” if the variant is an Indel.

Additionally, the present architecture separates the count of bases (A,C, G, T) in the input samples from the ways of counting a single, 1)reference, 2) a deletion, 3) an insertion, and 4) a SNP, such that theyare in separate channels. Advantageously, this can facilitate the CNNarchitecture to better make use of the count of different singles; forexample, the network does not need to tease out the singles of differentvariant types itself.

Turning to FIG. 9 , an example of counting A, C, G, or T on thesequencing reads is illustrated. The first row is a reference DNAsequence and the strings in the middle are sequencing reads. Generally,the starting position of a sequencing read is its alignment and acollection of alignments is a layout. For each position in the referencesequence, the layout provides the count of bases A, C, G, T of thesequencing reads at that position. The underlined bases suggest avariant at a reference position because the base with majority count isdifferent from the reference base. The bases that are highlighted butnot underlined are possible sequencing errors.

In an example for illustration purposes, the following pseudocode can beused to generate the input sample:

Input: p, genome position of a candidate variant, 1 -based r, referencegenome a, a set of elements representing the read alignments covering pat bp level, each element contains four members: 1) the matchedreference position (refpos), 1-based, 2) the offset of this bp withinits insertion pattern (insoffset), 1- based (only applicable if this bpis an inserted base), 3) the bp ‘A’ or ‘C’ or ‘G’ or ‘T’ or the gap ‘—’in r at refpos (refbase), 4) the bp or the gap in the correspondingaligned read at readpos (readbase). Output: A matrix x of order 33 by 4by 4 h.insert(‘A’, 0) h.insert(‘C’, 1) h.insert(‘G’, 2) h.insert(‘T’, 3)create a zero-filled matrix x of size (33 x 4 x 4) for each item in ado: /* Skip the Ns and IUPAC nucleotides in reference */ ifa[item].refbase not one of “ACGT-” then: continue /* Skip the Ns in read*/ if a[item].readbase not one of “ACGT-” then: continue /* Center thecandidate variant in the first dimension (33) of x */ offset <− refpos −p + 16 /* Skip if beyond scope */ if offset < 0 or offset > 32 then:continue /* Translate refbase and readbase into index */ refbaselndex <−h.get(a[item].refbase) readbaseindex <− h.get(a[item].readbase) /* Ifnot a deleted base */ if a[item].readbase ≠ ‘—’ then: /* If not ainserted base either, i.e. a ref or a SNP */ if a[item].refbase ≠ ‘—’then: increase x[offset][refbaseIndex][0] by 1 increasex[offset][readbaseIndex][1] by 1 increase x[offset][refbaseIndex][2] by1 increase x[offset][readbaseIndex][3] by 1 /* If is a inserted base */if a[item].refbase = ‘—’ then: newOffset <− min(offset +a[item].insoffset, 32) increase x[newOffset][readbaseIndex][1] by 1 /*If is a deleted base */ if a[item].readbase = ‘—’ then: increasex[offset][refbaseIndex][2] by 1 for i = 0 to 32 do: for j = 0 to 3 do:x[i][j][1] <− x[i][j][0] x[i][j][2] <− x[i][j][2] − x[i][j][0]x[i][j][3] <− x[i][j][3] − x[i][j][0] return x

FIG. 4 shows selected illustrations of how the present embodimentsrepresent three common types of small variants, and a non-variant. Eachvariant is presented with counts A, C, G, or T, top to bottomrespectively. A C>G SNP in the top-left illustration, a 9 bp insertionin the top-right illustration, a 4 bp deletion in the bottom-leftillustration, and a non-variant with reference allele in the in thebottom-right illustration. The intensity represents the strength of acertain variant signal. The SNP, insertion and deletion examples areideal with almost zero background noise. The non-variant exampleillustrates how the background noises look like when not mingled withany variant signal.

During training of the convolutional neural network, the system 100 canperform weight initialization to stabilize the variances of activationand back-propagated gradients at the beginning of model training. Insome cases, a He initializer can be used to initialize the weights ofhidden layers in the convolutional neural network because the Heinitializer is generally optimized for training extremely deep modelsusing rectified activation functions directly from scratch. In thesecases, for each layer, the weight of each node can be sampled from aunivariate normal distribution with σ=1÷√{square root over (d_(i)±2)},where d_(i) denotes the number of in-degree of the node. In furthercases, other suitable weight initialization approaches can be used.

In some embodiments, the system 100 can use batch normalization toensure zero mean and unit variance in each hidden layer of theconvolutional neural network; such that exploding or diminishinggradients can be avoided during training. In some cases, batchnormalization can be a computational bottleneck in neural networktraining because computing the mean and the standard deviation of alayer is not only a dependent step, but also a reduction step thatcannot be efficiently parallelized. Accordingly, in some embodiments,the system 100 can use “Scaled Exponential Linear Units” (SELUs), whichis a variant of a rectified activation function, as an activationfunction. Different from a standard batch normalization approach thatadds an implicit layer for the named purpose after each hidden layer,SELUs utilize a Banach fixed-point theorem to ensure convergence to zeromean and unit variance in each hidden layer without batch normalization.

In some embodiments, the system 100 can use an Adam optimizer to updatethe weights of the convolutional neural network by adaptivenode-specific learning rates. In this case, setting a global learningrate functions as setting an upper limit to the learning rates. Thisapproach advantageously allows the system 100 to remain at a higherlearning rate for a longer time to speed up the training process. Insome cases, while the Adam optimizer performs learning rate decayintrinsically, the system 100 can decrease the global learning rate whenthe cost of the model in training plateaued can lead to a better modelperformance. For example, in some cases, there may be two types oftraining modes. A fast training mode can use an adaptive decay methodthat uses an initial learning rate at, for example, 1e⁻³ and decreasesthe learning rate by a factor of, for example, 0.1 when the validationrate goes up and down for five rounds and stops after two times ofdecay. A nonstop training mode can receive a user's input as to when tostop and continue using a lower learning rate.

In some embodiments, even though there may be many labeled truthvariants in the training set, the scarcity of some labels, for examplevariants with a long indel length, could fail the model training byoverfitting to abundantly labeled data. To address the class imbalance,dropout and/or L2 regularization can be applied by the system 100.During training, dropout can be used to randomly ignore nodes in a layerwith probability p, then sum up the activations of remaining nodes, andfinally magnify the sum by 1/p. Then during testing, the activations ofall nodes can be summed with no dropout. With probability p, the dropouttechnique is effectively creating up to 1÷(1−p)^(n) possible subnetworksduring the training. Therefore, dropout can be seen as dividing anetwork into subnetworks with reused nodes during training. However, insome cases, for a layer with just enough nodes available, applyingdropout may require more nodes to be added, thus potentially increasingthe time needed to train a model. In an embodiment, dropout can beapplied only to the first fully connected layer (FC4) with p=0.5, and L2regularization can be applied to the hidden layers of the convolutionalneural network. In some cases, the system 100 can set the lambda of L2regularization to be the same as the learning rate.

Turning to FIG. 2 , a flowchart for a method for variant calling insingle molecule sequencing using a convolutional deep neural network200.

At block 202, the input module 150 obtains genomic data, for example,from the network interface 110, the input interface 106, or the database116.

At block 204, the input module 150 extracts overlapping sequencing readalignments from the genomic data and determines variants where the readalignments differ. In further embodiments, the input module 150 obtainsonly the extracted variant data, for example, in Variant Call Format(VCF).

At block 206, the input module 150 transforms the overlapping sequencingread alignments and associated variants into a multi-dimensional tensor.In a particular case, a shape of the multi-dimensional tensor can havethree dimensions; a first dimension corresponding to a position of avariant, a second dimension corresponding to a count of a base (A, C, G,or T) on a sequencing read, and a third dimension corresponding todifferent ways of counting bases. In some cases, the shape of themulti-dimensional tensor can be 33 by 4 by 4 for the first dimension,second dimension, and third dimension, respectively. In these cases, thefirst dimension representing a total comprising a candidate variant andflanking base-pairs added to both sides of the candidate.

At block 208, the machine learning module 152 passes themulti-dimensional tensor through a trained machine learning architectureto generate categorical output variables by minimizing a cost functioniterated over each variant. The categorical output variables can includeat least one of: alternate base at a single-nucleotide polymorphism,zygosity of a variant, variant type, and length of an indel. In somecases, these output variables can be represented by a vector (forexample, with a one-dimensional tensor) using one-hot or probabilityencoding. The machine learning architecture trained using a genomicdataset with previously identified variants (i.e., truth variants). Inan embodiment, for the training dataset, each truth variant was pairedwith two non-variants randomly sampled from the genome at all possiblenon-variant and non-ambiguous sites. In an embodiment, a minority (forexample, 10%) of the training dataset was used instead for validation ofthe machine learning model.

In an embodiment, the machine learning architecture includes aconvolutional neural network comprising an input layer feedingsequential convolutional and pooling layers, which feed at least oneregression and classification layer. In a particular case, the machinelearning architecture includes three convolutional layers, each followedby pooling layers, and at least two fully-connected layers. In aparticular case, each convolutional layer using an SELU activationfunction. In this case, the prediction for alternate base at asingle-nucleotide polymorphism is predicted from the firstfully-connected layer and the predictions for zygosity of the variant,variant type, and length of the indel are predicted from the secondfully-connected layer. In some cases, the fully-connected layers can usebatch normalization, such as SELU activation functions or sigmoidactivation functions, and dropout techniques.

At block 210, the output module 154 outputs the categorical outputvariables of the trained machine learning architecture.

In some cases, visualizations of the output of the system 100 can begenerated. For example, an interactive python notebook accessible withina web browser or a command line script for visualizing inputs and theircorresponding node activations in hidden layers and output layers. In anexample, the visualization can show the input and node activations inall hidden layers and output layers of an A>G SNP variant in sampleHG002 test against a model trained with samples from HG001 for athousand epochs at 1e⁻³ learning rate. Each of the nodes can beconsidered as a feature deduced through a chain of nonlineartransformations of the read alignments input.

Advantageously, embodiments of the present disclosure can besubstantially computationally efficient such that they can run on moderndesktop and server computers with commodity configurations. In somecases, implementations of the system 100 can be roughly divided into twogroups, one is sample preparation (preprocessing and model training),and the second is sample evaluation (model evaluation andvisualization). Model training runs efficiently because it can invokeoptimized techniques; for example, Tensorflow, which is maintained by alarge developer community and has been intensively optimized. In somecases, sample preprocessing was determined to be more efficient if usingspecialized compilers; for example, by using Pypy36, a Just-In-Time(JIT) compiler that performs as an alternative to the native pythoninterpreter. In example experiments, Pypy sped up sample preparation by5 to 10 times.

Generally, memory consumption in model training can be a concern. Forexample, with a naïve encoding, HG001 requires 40 GB memory to store thevariant and non-variant samples, which could prevent effective GPUutilization. The present inventors observed that, in some cases, thesesamples are immutable and follow the “write once, read many” accesspattern. Thus, in some cases, the system 100 can apply in-memorycompression; for example, using the blosc37 library with the lz4hccompression algorithm, which provides a high compression ratio, 100 MB/scompression rate, and an ultra-fast decompression rate at 7 GB/s.Example experiments conducted by the present inventors show thatapplying in-memory compression does not impact the speed but decreasedthe memory consumption by five times.

The present inventors conducted an example experiment to verify the useof the present embodiments. As part of the example experiment, thepresent embodiments were benchmarked on DNA sequencing datasets usingthree different example sequencing technologies: Illumina, PacBio, andONT. Illumina is an example “next-generation sequencing” (NGS)technology and PacBio and ONT are example “single molecule sequencing”(SMS) technologies. SMS generally produces much longer sequencing readsand generally easier in terms of sample preparation. However, for SMS,the error rate is presently very high at 10-15% in comparison to NGS,which is less than 1%. Advantageously, the present embodiments canprovide a suitable error rate to do variant calling using SMS.

The example experiments use the Genome-in-a-Bottle (GIAB) dataset. GIABis a project that produces sequencing data of all available sequencingtechnologies of a few standard samples with known variants. The GIABdataset provides high-confidence SNPs and indels for a standardreference sample HG001 (also referred to as NA12878) by integrating andarbitrating between 14 datasets from five sequencing and genotypingtechnologies, seven read mappers and three variant callers. For theexample experiments, the training dataset (also referred to as the truthdataset) used dataset version 3.3.2 for HG001 that comprises 3,042,789SNPs, 241,176 insertions and 247,178 deletions for the GRCh38 referencegenome, along with 3,209,315 SNPs, 225,097 insertions and 245,552deletions for GRCh37. The dataset also provides a list of regions thatcover 83.8% and 90.8% of the GRCh38 and the GRCh37 reference genome,where variants were confidently genotyped. The GIAB extensive projectwas also used that further introduced four standard samples, includingthe Ashkenazim Jewish sample HG002, containing 3,077,510 SNPs, 249,492insertions and 256,137 deletions for GRCh37, 3,098,941 SNPs, 239,707insertions and 257,019 deletions for GRCh38. 83.2% of the whole genomewas marked as confident for both the GRCh37 and GRCh38.

The example experiments use the Illumina technology produced by theNational Institute of Standards and Technology (NIST) and Illumina. Boththe HG001 and HG002 datasets were generated on an Illumina HiSeq 2500 inRapid Mode (v1) with 2×148 bp paired-end reads. Both have approximately300× total coverage and were aligned to GRCh38 decoy version 1 usingNovoalign version 3.02.07. In the example experiments, the two datasetswere further down-sampled to 50× to match the available data coverage ofthe other two SMS.

The example experiments use the Pacific Bioscience (PacBio) technologythat was produced by NIST and Mt. Sinai School of Medicine. The HG001dataset has 44× coverage, and the HG002 has 69×. Both datasets comprise90% P6-C4 and 10% P5-C3 sequencing chemistry and have a sequence N50length between 10 k-11 kbp. Reads were extracted from the downloadedalignments and aligned again to GRCh37 decoy version 5 using NGMLRversion 0.2.3.

The example experiments use the Oxford Nanopore (ONT) technology thatwas generated by the Nanopore WGS consortium. Only data for sample HG001are available to date, thus limiting the “cross sample variant callingevaluation” and “combined sampled training” on ONT data. The exampleexperiments used the ‘rel3’ release sequenced on the Oxford NanoporeMinION using 1D ligation kits (450 bp/s) and R9.4 chemistry. The releasecomprises 39 flowcells and 91.2 G bases, about 30× coverage. The readswere downloaded in raw fastq formatted and aligned to GRCh37 decoyversion 5 using NGMLR version 0.2.3.

In the example experiments, ‘good performance’ implies correctpredictions could be made even when the evidence is marginal todistinguish a genuine variant from non-variant (reference) position. Toachieve this goal, the present inventors paired each truth variant withtwo non-variants randomly sampled from the genome at all possiblenon-variant and non-ambiguous sites for model training. With about 3.5 Mtruth variants from the GIAB dataset, about 7 M non-variants are addedas samples for training of the CNN model. In this case, pairing twonon-variants with a variant was used to teach the CNN to prefer anon-variant decision when the odds of making either decisions are even.

For the example experiments, the present inventors randomly partitionedall samples into 90% for training and 10% for validation. In this case,different datasets were used for testing samples. For example, in theexample experiments, the samples of HG002 were used to test against amodel trained on HG001, and vice versa.

The benchmarking was conducted at common variant sites from 1000 GenomesProject phase 3 with minor allele frequency ≥5%. The system's 100performance to call variants genome-wide was examined. In addition,other variants caller approaches were examined for comparison. Testswere conducted to determine: 1) what are the characteristics of falsepositives and false negatives; 2) can lower learning rate and longertraining provide better performance; 3) can a model train on truthvariants from multiple samples provide better performance; 4) can ahigher input data quality improve the variant calling performance; and5) is the current network design sufficient in terms of learningcapacity.

For the example experiments, GPU acceleration was used for modeltraining and CPU-only was used for variant calling. Using ahigh-performance desktop GPU model GTX 1080 Ti, 170 seconds are neededper epoch, which leads to about 5 hours to finish training a model withthe fast training mode. However, for variant calling the speed up by GPUis insignificant because CPU workloads such as VCF file formatting andI/O operations dominate. Variant calling at 8.5 M common variant sitestakes about 40 minutes using 28 CPU cores. Variant calling genome-widevaries between 30 minutes to a few hours subject to which sequencingtechnology and alternative allele frequency cutoff were used.

In the example experiments, the present embodiments were benchmarked onthree sequencing technologies: Illumina, PacBio, and ONT using both thefast and the nonstop training mode. In nonstop training mode, trainingthe model was started from 0 to 999-epoch at learning rate 1e⁻³, then to1499-epoch at 1e⁻⁴, and finally to 1999-epoch at 1e⁻⁵. The modelgenerated by the fast mode was then benchmarked, and all three modelsstopped at different learning rates in the nonstop mode. Variant callingwas also benchmarked on one sample (e.g., HG001) using a model trainedon another sample (e.g., HG002). Further, a GATK UnifiedGenotyper andGATK HaplotypeCaller were benchmarked for comparison. Noteworthy, GATKUnifiedGenotyper was superseded by GATK HaplotypeCaller, thus forIllumina data, one should refer to the results of HaplotypeCaller as thetrue performance of GATK. However, the experimental benchmarks show thatUnifiedGenotyper performed better than HaplotypeCaller on the PacBio andONT data, thus UnifiedGenotyper was also benchmarked for all threetechnologies to make parallel comparisons. A benchmark was alsodetermined for SMS reads using the Nanopolish v0.9.0 tool.

In the example experiments, as described herein, benchmarks at commonvariant sites from 1000 Genomes Project phase 3 with global MAF (minorallele frequency) ≥5% (8,511,819 sites for GRCh37, 8,493,490 sites forGRCh38, hereafter referred to as “1KGp3”) demonstrate the performance ofthe present embodiments on a typical precision medicine application,where tens to hundreds of known clinically relevant or actionablevariants are being genotyped. This is becoming increasingly important asSMS is becoming more widely used for clinical diagnosis of structuralvariations, but at the same time, doctors and researchers also want toknow if there exist any actionable or incidental small variants withoutadditional short read sequencing. So firstly, the present embodimentswere evaluated on common variant sites before extending the evaluationgenome-wide.

In the example experiments, vcfeval in RTG Tools version 3.7 was used tobenchmark the results and generate five metrics including FPR (FalsePositive Rate), FNR (False Negative Rate), Precision, Recall, andF1-score. From the number of true positives (TP), false positives (FP),true negatives (TN), and false negatives (FN), the five metrics weredetermined as FPR=FP÷(FP+TN), FNR=FN≥(FN+TP), Precision=TP≥(TP+FP),Recall=TP≥(TP+FN), and

${F\; 1\text{-}{score}} = {\frac{2TP}{{2TP} + {FN} + {FP}}.}$TP was defined as variants existing in both 1KGp3 and GIAB dataset thatidentified as a variant with no discrepancy in terms of allele, type,zygosity and indel length if applicable. TN was defined as variantsexisting in 1KGp3 but not in the GIAB dataset that identified as anon-variant. FP was defined as 1) sites supposed to be TN but identifiedas variant, or 2) variants existing in the GIAB dataset that alsoidentified as a variant, but with discrepant variant type, alternativeallele or zygosity. FN was defined as the variants existing in the GIABdataset but identified as a non-variant. F1-score is the harmonic meanof the precision and recall. RTG vcfeval also provides the best variantquality cutoff for each dataset, filtering the variants under which canmaximize the F1-score.

TABLE 1 shows example results for the performance of the presentembodiments in the example experiment on Illumina data. In thisexperiment, the best accuracy was shown to be achieved by callingvariants in HG001 using the model trained on HG001 at 999-epoch, with0.25% FPR, 0.41% FNR, 99.75% precision and 99.67% F1-score. Generally, amajor concern of using deep learning or any statistical learningtechnique for variant calling is the potential for overfitting to thetraining samples. The present results show that the present embodimentswere not affected by overfitting, and the versatility of the trainedmodels were validated by calling variants in a genome using a modeltrained on a second sample. The gap of FNR between the presentembodiments and GATK UnifiedGenotyper on HG001 is 0.68% (3.11% against2.43%) but enlarged to 3.27% (5.80% against 2.52%) on HG002.

TABLE 1 Over- Over- Ending Call Best Over- Over- all all SNP SNP IndelIndel Model Learning Vari- Variant all all Preci- F1 SNP SNP Preci- F1Indel Indel Preci- F1 Seq. Trained Trained Rate and ants Quality FPR FNRsion Score FPR FNR sion Score FPR FNR sion Score Tech. on Epochs Lambdain Cutoff % % % % % % % % % % % % Illumina HG001   67* 1.E−05 HG001 670.28 0.45 99.72 99.64   .07  .10 99.9 99.9 1.9 3.3 98.0 97.3  999 1.E−03119 0.25 0.41 99.75 99.67   .08  .08 99.9 99.9 1.6 3.1 98.3 97.5 14991.E−04 128 0.28 0.41 99.72 99.66   .08  .08 99.9 99.9 1.8 3.1 98.0 97.41999 1.E−05 147 0.29 0.42 99.71 99.64   .08  .09 99.9 99.9 1.9 3.2 97.997.3 HG001   67* 1.E−05 HG002 58 0.32 0.51 99.68 99.59   .11  .15 99.899.8 2.1 3.5 97.8 97.1  999 1.E−03 107 0.30 0.49 99.70 99.61   .11  .1499.8 99.8 1.  3.4 97.9 97.2 1499 1.E−04 151 0.34 0.54 99.66 99.56   .11 .16 99.8 99.8 2.2 3.8 97.6 96.9 1999 1.E−05 147 0.37 0.54 99.63 99.55  .12  .15 99.8 99.8 2.4 3.8 97.4 96.7 HG002   66* 1.E−05 HG001 53 0.310.80 99.69 99.44   .08  .14 99.9 99.8 2.1 6.2 97.6 95.6  999 1.E−03 960.28 0.76 99.72 99.48   .07  .13 99.9 99.9 2.0 6.0 97.8 95.9 1499 1.E−04134 0.33 0.81 99.67 99.43   .08  .15 99. 99.8 2.3 6.3 97.4 95.5 19991.E−05 148 0.35 0.83 99.65 99.41   .08  .15 99.9 99.8 2.5 6.5 97.3 95.3HG002   66* 1.E−05 HG002 54 0.28 0.76 99.72 99.48   .07  .13 99.9 99.92.0 6.1 97.8 95.8  999 1.E−03 99 0.24 0.72 99.76 99.52   .06  .13 99.999.9 1.7 5.8 98.1 96.1 1499 1.E−04 124 0.27 0.72 99.73 99.50   .07  .1299.9 99.9 1.9 5.8 97.9 95.9 1999 1.E−05 132 0.28 0.73 99.72 99.50   .07 .12 99.9 99.9 2.0 5.9 97.8 95.9 DeepVariant 3 0.04 0.06 99.96 99.95 0.00.0 99.9 99.9 0.2 0.2 99.7 99.7 LoFreq Benchmarked SNP only 10    0.585.0 91.6 — — — — GATK UnifiedGenotyper, HG001 3 0.19 0.35 99.81 99.730.1 0.1 99.9 99.9 0.8 2.4 99.1 98.3 GATK HaplotypeCaller, HG001 4 0.070.11 99.93 99.91 0.0 0.0 99.9 99.9 0.5 0.6 99.4 99.4 GATKUnifiedGenotyper, HG002 3 0.20 0.37 99.80 99.71 0.1 0.1 99.8 99.8 0.72.5 99.2 98.3 GATK HaplotypeCaller, HG002 5 0.07 0.10 99.93 99.92 0.00.1 99.9 99.9 0.3 0.5 99.6 99.5 where * represents fast training mode.

TABLE 2 shows example results for the performance of the presentembodiments in the example experiment on PacBio data. In thisexperiment, the best accuracy was shown to be achieved by callingvariants in HG001 using the model trained on HG001 at 1499-epoch, with2.17% FPR, 6.06% FNR, 97.70% precision and 95.78% F1-score. Forcomparison, DeepVariant has been reported to have benchmarked the samedataset and reported 97.25% precision, 88.51% recall (11.49% FNR) and92.67% F1-score. The present benchmark differs from DeepVariant becauseIndels >4 bp have been removed (e.g. 52,665 sites for GRCh38 and 52,709for GRCh37 in HG001) from both the baseline and variant calls. If onewere to assume DeepVariant can identify all the 91 k Indels >4 bpcorrectly, the recall of DeepVariant would increase to 90.73% (9.27%FNR), which is still 3.21% lower than the present embodiments. The SNPcalling F1-score on PacBio data for the present embodiments topped at99.28% for HG001 and 99.30% for HG002, illustrating that the presentembodiments can be suitable for genotyping SNPs sensitively andaccurately at known clinically relevant or actionable variants usingPacBio reads in precision medicine applications.

TABLE 2 Over- Over- Ending Call Best Over- Over- all all SNP SNP IndelIndel Model Learning Vari- Variant all all Preci- F1 SNP SNP Preci- F1Indel Indel Preci- F1 Seq. Trained Trained Rate and ants Quality FPR FNRsion Score FPR FNR sion Score FPR FNR sion Score Tech. on Epochs Lambdain Cutoff % % % % % % % % % % % % PacBio HG001   50* 1.E−05 HG001 691.51 7.41 98.38 95.39 .32 1.4 99.6 99.1 10 60 76 97.3  999 1.E−03 941.39 7.07 98.51 95.64 .26 1.2 99.7 99.2 10 58 78.2 97.5 1499 1.E−04 892.17 6.06 97.70 95.78 .25 1.1 99.7 99.2 16 49 72.0 97.4 1999 1.E−05 852.43 5.81 97.43 95.78 .26 1.1 99.7 99.2 18 46 70.4 97.3 HG001   50*1.E−05 HG002 75 1.78 7.48 98.07 95.22 .71 1.4 99.2 98.9 10 60 77.7 97.1 999 1.E−03 96 1.98 7.31 97.87 95.21 .75 1.4 99.2 98.8 11 58 76.0 97.21499 1.E−04 114 2.07 7.77 97.76 94.91 .76 1.4 99.2 98.8 12 63 72.6 96.91999 1.E−05 123 1.97 7.94 97.86 94.87 .75 1.4 99.2 98.9 11 64 73.0 96.7HG002   72* 1.E−05 HG001 56 1.63 9.22 98.20 94.35 .49 2.5 99.4 98.4 1068 72.4 95.6  999 1.E−03 99 1.69 8.47 98.16 94.73 .57 1.8 99.4 98.7 1067 73.3 95.9 1499 1.E−04 116 2.43 8.25 97.36 94.47 .80 1.9 99.1 98.6 1464 66.9 95.5 1999 1.E−05 127 2.34 8.57 97.45 94.34 .89 2.0 99.0 98.5 1366 68.0 95.3 HG002   72* 1.E−05 HG002 55 1.88 7.08 97.98 95.38 .55 1.299.4 99.1 12 58 75.2 95.8  999 1.E−03 88 1.86 6.59 98.01 95.66 .49 1.199.5 99.1 12 54 76.3 96.1 1499 1.E−04 101 2.02 5.81 97.85 95.99 .42 1.099.5 99.2 14 47 76.1 95.9 1999 1.E−05 101 2.05 5.70 97.83 96.03 .41 0.999.5 99.3 14 46 75.9 95.9 GATK UnifiedGenotyper, HG001 3 0.83 99.92 8.190.15 .94 99 8.19 0.17 — — — — GATK HaplotypeCaller, HG001 4 0.08 97.9152.26 4.02 .06 97 61.1 4.53 0.6 99 0.39  0.04 GATK UnifiedGenotyper,HG002 3 0.75 99.91 9.91 0.17 .85 99 9.91 0.19 — — — — GATKHaplotypeCaller, HG002 5 0.69 98.74 63.81 2.47 .79 98 63.8 2.78 .02 100 15.6  0.00 where * represents fast training mode.

TABLE 3 shows example results for the performance of the presentembodiments in the example experiment on ONT data. As there are nopublicly available deep coverage ONT datasets for HG002, the exampleexperiment benchmarked variant calls in the chromosome 1 of HG001 usingmodels trained on all chromosomes of HG001 except for the chromosome 1.The best F1-score, 90.53%, was witnessed to be achieved at 1999-epoch,with FPR 2.96%, FNR 14.78% and precision 96.55%. For comparison, thepresent inventors applied Nanopolish to the whole genome of HG001, using24 CPU cores and a peak of 160 GB memory, and determined that Nanopolishfinished in 4.5 days and achieved 0.04%, 21.57%, 97.51%, and 86.93% onFPR, FNR, precision and F1-score, respectively.

TABLE 3 Over- Over- Ending Call Best Over- Over- all all SNP SNP IndelIndel Model Learning Vari- Variant all all Preci- F1 SNP SNP Preci- F1Indel Indel Preci- F1 Seq. Trained Trained Rate and ants Quality FPR FNRsion Score FPR FNR sion Score FPR FNR sion Score Tech. on Epochs Lambdain Cutoff % % % % % % % % % % % % ONT HG001  110* 1.E−05 HG001 47 3.4017.09 95.93 88.94 3.3 9.2 96.3 93.4 3.9 86.4 76.59 23.07 (except  9991.E−03 (chr1) 53 4.05 16.31 95.20 89.07 3.7 8.9 95.8 93.3 6.2 81.7 73.1929.28 chr1) 1499 1.E−04 70 2.95 14.99 96.55 90.41 2.5 7.7 97.2 94.6 6.379.2 75.46 32.51 1999 1.E−05 74 2.96 14.78 96.55 90.53 2.5 7.5 97.2 94.76.6 78.6 74.90 33.24 Nanopolish, HG001 20 0.04 21.57 97.51 86.93  .03 1598.1 90.7 0.1 63.4 88.57 51.73 DeepVariant (chr1) 3 0.22 23.82 96.2685.05  .21 15 96.7 90.3 0.3 89.1 72.98 18.95 LoFreq — Benchmarking SNPonly 2.7 46.9 90.6 66 — — — GATK UnifiedGenotyper, HG001 1 3.66 84.4480.07 26.05 4.1 82 80.0 28.7 — — — — GATK HaplotypeCaller, HG001 1 0.4198.65 76.04 2.65  .45 98 76.7  2.9 0.1 99.9  9.59  0.03 Nanopolish,afcut0.2, depthcut4, chr19 20 0.15 34.13 90.00 76.07  .08 27 94.5 82.20.7 83.0 36.45 23.13 Nanopolish, 1 kgp3, chr19 20 0.08 22.71 95.28 85.35 .05 16 96.8 89.7 0.3 66.7 73.64 45.79 where * represents fast trainingmode.

Beyond benchmarking variants at sites known to be variable in a sample,the example experiments also benchmarked performance of the presentembodiments on calling variants genome-wide. Calling variantsgenome-wide is challenging because it tests not only how good the system100 can derive the correct variant type, zygosity and alternative alleleof a variant when evidence is marginal, but also in reverse, how goodthe system 100 can filter/suppress a non-variant even in the presence ofsequencing errors or other artificial signals. Instead of naivelyevaluating all three billion sites of the whole genome, the exampleexperiment tested the performance at different alternative allelecutoffs for all three sequencing technologies. As expected, a higherallele cutoff speeds up variant calling by producing fewer candidates tobe tested but worsens recall especially for noisy data like PacBio andONT. The example experiments provide a reference point on how to choosea cutoff for each sequencing technology to achieve a good balancebetween recall and running speed. Precision, Recall and F1-score metricswere used but FPR (calculated as FP÷(FP+TN)) was not used because FPbecomes negligibly small compare to TN, which is around three billion inhuman genome-wide variant identification. All models were trained for1000 epochs with learning rate at 1 e-3. The example experiments wereperformed on two Intel Xeon E5-2680 v4 using all 28 cores.

Example results for the performance of the present embodiments in theexample experiments on calling variants genome-wide are shown in TABLE4. As expected, with higher alternative allele frequency threshold(0.2), the precision was higher while the recall and time consumptionwas reduced in all experiments. In this example experiment, for Illuminadata, the best F1-score (with 0.2 allele frequency) achieved for thepresent embodiments was 98.65% for HG001 and 98.61% for HG002. Theruntime varied between half and an hour (40 minutes for the bestF1-score). While GATK HaplotypeCaller achieved highest performance onIllumina data, achieving an F1-score 99.76% for HG001 and 99.70% forHG002, it required a runtime of about 8 hours. Inspecting the falsepositive and false negative variant calls for the present embodiments,it was determined that it was about 0.19% for FP and 0.15% for FN.

TABLE 4 Train Alt. Best using Call Allele Variant Time Overall SNP IndelSeq. Variants Variants Freq. Quality Consump- Preci- Recall F1 Preci-Recall F1 Preci- Recall F1 Tech in in Cutoff Cutoff tion sion % % Score% sion % % Score % sion % % Score % Illumina HG001 HG001 0.1 189 1:0898.2 98.9 98.6 98.1 99.9 99.0 98.6 96.9 97.7 0.2 182 0:43 98.4 98.9 98.798.4 99.9 99.1 98.7 95.1 96.8 0.25 180 0:26 98.7 98.0 98.3 98.7 99.899.2 98.6 87.3 92.6 HG002 0.1 192 1:11 98.1 98.8 98.5 98.1 99.8 98.998.5 96.5 97.5 0.2 183 0:41 98.4 98.8 98.6 98.3 99.8 99.0 98.5 94.9 96.60.25 182 0:30 98.7 97.9 98.3 98.7 99.7 99.1 98.4 87.3 92.5 HG002 HG0010.1 198 1:16 98.6 98.5 98.5 98.7 99.9 99.2 97.9 93.3 95.6 0.2 192 0:4798.8 98.4 98.6 98.8 99.9 99.3 98.0 91.6 94.7 0.25 184 0:25 98.9 97.698.3 99.1 99.8 99.4 97.9 85.0 91.0 HG002 0.1 195 1:07 98.5 98.6 98.698.6 99.9 99.2 98.1 93.7 95.9 0.2 188 0:44 98.7 98.5 98.6 98.8 99.8 99.298.2 92.2 95.1 0.25 182 0:25 99.0 97.7 98.3 99.0 99.7 99.3 98.1 85.791.5 GATK UnifiedGenotyper, 51 0:46 99.4 99.4 99.4 99.4 99.4 99.5 99.996.4 97.6 HG001 GATK HaplotypeCaller, 5 8:45 99.6 99.8 99.7 99.8 99.899.8 100 98.9 99.0 HG001 GATK UnifiedGenotyper, 4 0:46 98.7 99.4 98.899.4 99.1 98.7 99.8 96.5 97.7 HG002 GATK HaplotypeCaller, 5 8:23 99.599.8 99.6 99.8 99.7 99.6 99.9 99.2 99.3 HG002 PacBio HG001 HG001 0.1 1579:46 96.3 88.6 92.3 96.7 99.5 98.0 79.1 31.1 44.6 0.2 130 3:53 98.1 87.692.6 99.0 96.6 97.7 75.8 31.5 44.5 0.25 125 2:01 98.6 83.1 90.2 99.491.4 95.2 78.5 27.1 40.3 HG002 0.1 153 9:24 97.0 89.1 92.9 97.9 99.298.5 71.5 34.2 46.3 0.2 132 3:34 97.9 88.3 92.9 99.0 97.1 98.0 73.3 34.046.5 0.25 116 1:46 98.1 84.7 90.9 99.2 92.5 95.7 75.5 31.2 44.2 HG002HG001 0.1 163 14:55  95.6 86.7 90.9 96.6 99.0 97.7 59.1 24.5 34.6 0.2147 3:29 97.5 85.6 91.2 98.9 96.1 97.5 58.2 23.6 33.6 0.25 139 1:39 98.281.5 89.0 99.3 90.9 94.9 66.3 21.1 32.0 HG002 0.1 150 15:31  97.1 89.393.0 98.3 99.1 98.7 69.1 39.7 50.5 0.2 134 3:34 98.1 88.5 93.1 99.4 97.398.3 72.2 36.5 48.5 0.25 118 1:46 98.2 84.8 91.0 99.5 92.8 96.0 72.931.4 43.9 Oxford HG001 HG001 0.1 140 13:01  86.2 71.0 77.9 86.8 91.989.2 55.3 10.6 17.9 Nanopore 0.2 139 4:47 87.2 70.2 77.8 87.7 88.0 87.859.0 9.90 16.9 (rel3) 0.25 136 2:40 87.8 68.5 77.0 88.3 82.9 85.4 59.49.26 16.0 0.35 130 1:30 91.0 57.4 70.4 91.3 65.8 76.5 67.3 6.62 12.0Oxford HG001 HG001 0.2 162 5:53 88.8 85.8 87.3 88.9 93.8 91.2 72.1 8.3214.9 Nanopore 0.25 159 3:12 89.1 82.2 85.5 89.3 90.1 89.7 72.4 8.02 14.4(rel5) 0.35 148 1:51 91.2 67.9 77.8 91.5 75.2 82.5 71.2 6.54 11.9

Advantageously, as described herein, the neural network architecture ofthe present embodiments can be used as an orthogonal method to validatea variant for filtering or validating results to further increaseaccuracy.

In this example experiment, for the PacBio, the best F1-scores were alsoachieved at 0.2 allele frequency cutoff. The best F1-score is 92.57% forHG001 and 93.05% for HG002 running for ˜3.5 hours. In contrast,DeepVariant has been reported to achieve 35.79% F1-score (22.14%precision, 93.36% recall) on HG001 with PacBio. The runtime for thepresent embdoiments at 0.25 frequency cutoff is about 2 hours, which isabout half the time consumption at 0.2 frequency cutoff, and about ⅕ thetime consumption at 0.1 frequency cutoff. In this example experiment,for ONT (rel3), the best F1-score 77.89% was achieved at 0.1 frequencycutoff. However, the F1-score at 0.25 frequency cutoff is just slightlylower (76.95%), but ran about five times faster, from 13 hours to lessthan three hours. In these cases, the present inventors have determinedthat using 0.25 may be ideal as the frequency cutoff. The runtime is onaverage about 1.5 times longer than PacBio, due to the higher level ofnoise in data. Using the new rel5 ONT with better base calling quality,the best F1-score has increased from 87.26% (9.37% higher than rel3).The recall of SNP and the precision of Indel were the most substantiallyincreased.

The example experiments also benchmarked against a newer version ofDeepVariant (v0.6.1). Generally, DeepVariant requires base-quality, thusit generally fails on the PacBio dataset, in which base-quality is notprovided. On ONT data (rel5), DeepVariant performed much better than thetraditional variant callers that were not designed for long reads, butit performed worse than the present embodiments. It was also found thatDeepVariant's computational resource consumption on long reads isprohibitively high and, in the example experiments, it was only able tocall variants in few chromosomes. For the example experiments using thenewer version of Deep Variant, using transfer-learning, the presentinventors trained two models for ONT data on chromosome 1 and 21respectively, and called variants in chromosome 1 and 22 against thedifferent models. In total three settings were benchmarked: 1) callvariants in chromosome 1 against the chromosome 21 model, 2) callvariants in chromosome 22 against the chromosome 21 model, and 3) callvariants in chromosome 22 against the chromosome 1 model. Training themodels required about 1.5 days until the validation showed a decreasingF1-score with further training. Using 24 CPU cores, the first step ofvariant calling generated 337 GB candidate variants data in 1,683minutes for chromosome 1 and generated 53G data in 319 minutes forchromosome 21. The second step of variant calling took 1,171 and 213minutes to finish for chromosome 1 and 22, respectively. The last steptook 160 minutes and was very memory intensive, requiring 74 GB of RAMfor chromosome 1. In terms of F1-score, DeepVariant has achieved 83.05%in chromosome 1, and 77.89% in chromosome 22, against the model trainedon chromosome 21. The present inventors verified that more samples formodel training do not lead to better variant calling performance, usingthe model trained on chromosome 1, the F1-score dropped slightly to77.09% for variants in chromosome 22. Using the computational resourceconsumption on chromosome 1, it is estimated that the newer version ofDeepVariant would require 4 TB storage and about one month for wholegenome variant calling of a genome sequenced with long reads.

The example experiments further benchmarked three additional variantcallers, including Vardict (v20180724), LoFreq (v2.1.3.1), and FreeBayes(v1.1.0-60-gc15b070). Vardict requires base quality, thus failed on thePacBio dataset, in which base quality is not provided. Vardictidentified only 62,590 variants in the ONT dataset, among them only 231variants are true positives. The results suggest Vardict is not yetready for Single Molecule Sequencing long reads. To enable Indel callingin LoFreq, BAQ (Base Alignment Quality) needs to be calculated inadvance. However, the BAQ calculation works only for Illumina reads,thus for LoFreq, it was only benchmarked in SNP calling. Meanwhile,LoFreq does not provide zygosity in the result, and thus the exampleexperiments were prohibited from using “RTG vcfeval” for performanceevaluation. Thus, a true positive in LoFreq was considered to be amatched truth record in 1) chromosome, 2) position and 3) alternativeallele. LoFreq requires base quality, thus failed on the PacBio dataset,in which base quality is not provided. The results suggest that LoFreqis capable of SNP detection in Single Molecule Sequencing long reads.The example experiments were unable to finish running Freebayes on boththe PacBio dataset and the ONT dataset after they failed to complete oneither dataset after running for one month. According to the percentageof genome covered with variant calls, it is estimated that severalmonths, 65 and 104 machine days on a latest 24-core machine, arerequired for a single PacBio and ONT dataset, respectively.

GIAB datasets were constructed from a consensus of multipleshort-variant callers, thus tend to bias toward easy regions that areaccessible. So, the example experiments also benchmarked the Syndipdataset, which is a benchmark dataset from the de novo PacBio assembliesof two homozygous human cell lines. The dataset provides a relativelymore accurate and less biased estimate of small-variant-calling errorrates in a realistic context. The results show that, when using Syndipvariants for training, the performance of calling variants in both HG001and HG002 at known variants are good. However, using the same model(Syndip), the performance dropped both at the Syndip known sites(excluding variants >4 bp, from 99.51% (HG001) to 98.52%) and for thewhole genome (excluding variants >4 bp, from 94.88% (HG001) to 94.02%).The results support that Syndip contains variants that are harder toidentify. In this way, in some cases, Syndip may be included whentraining models for the present embodiments to improve performance inthe hard regions.

The truth SNPs and Indels provided by GIAB were intensively called andmeticulously curated, and the accuracy and sensitivity of the GIABdatasets are unmatched. However, since the GIAB variants were generatedwithout incorporating any SMS technology, it is possible that we canconsummate GIAB by identifying variants not yet in GIAB, butspecifically detected both by using the PacBio and the ONT data. For theHG001 sample (variants called in HG001 using a model trained on HG001),we extracted the so-called “false positive” variants (identifiedgenome-wide with a 0.2 alternative allele frequency cutoff) called inboth the PacBio and ONT dataset. Then we calculated the geometric meanof the variant qualities of the two datasets, and we filtered thevariants with a mean quality lower than 135 (calculated as the geometricmean of the two best variant quality cutoffs, 130 and 139). Theresulting catalog of 3,135 variants retained are listed in SupplementaryData 1. 2,732 are SNPs, 298 are deletions, and 105 are insertions. Amongthe SNPs, 1,602 are transitions, and 1,130 are transversions. The Ti/Tvratio is ˜1.42, which is substantially higher than random (0.5),suggesting a true biological origin. The top ten variants in quality wasmanually inspected using IGV to determine their authenticity (asillustrated in FIG. 5A). Among the ten variants, there is one convincingexample at 2:163,811,179 (GRCh37) that GIAB has previously missed.Another seven examples have weaker supports that need to be furthervalidated using other orthogonal methods. Possible artifactsincluding 1) 7:89,312,043 has multiple SNPs in its vicinity, which is atypical sign of false alignment, 2) 1:566,371 20:3,200,689 (asillustrated in FIG. 5A) are located in the middle of homopolymerrepeats, which could be caused by misalignment, 3) X:143,214,235 showssignificant strand bias in Illumina data, and 4) X:140,640,513,X:143,218,136, and 9:113,964,088 are potential heterozygous variants butwith allele frequency notably deviated from 0.5. Two examples are usedbecause of the difference in representation: 13:104,270,904 and10:65,260,789 have other GIAB truth variants in their 5 bp flankingregions. The example experiments show that the GIAB datasets are ofsuperior quality and are the enabler of machine learning baseddownstream applications such as the present embodiments.

FIGS. 5A to 5D illustrate an Integrative Genomics Viewer (IGV) screencapture of selected variants. FIG. 5A illustrates a heterozygote SNPfrom T to G at chromosome 11, position 98,146,409 called only in thePacBio and ONT data, FIG. 5B illustrates a heterozygote deletion AA atchromosome 20, position 3,200,689 not called in all three technologies,FIG. 5C illustrates a heterozygote insertion ATCCTTCCT at chromosome 1,position 184,999,851 called only in the Illumina data, and FIG. 5Dillustrates a heterozygote deletion G at chromosome 1, position5,072,694 called in all three technologies. The tracks from top to downshow the alignments of the Illumina, PacBio, and ONT reads from HG001aligned to the human reference GRCh37.

The present inventors, as part of the example experiments, also analyzedwhy the PacBio and ONT technologies cannot detect some variants. FIG. 6shows the number of known variants undetected by different combinationsof sequencing technologies. The genome sequence immediately after thevariants was inspected and it was found that among the 12,331 variantsundetected by all three sequencing technologies, 3,289 (26.67%) arelocated in homopolymer runs, and 3,632 (29.45%) are located in shorttandem repeats. Among the 178,331 variants that cannot be detected byPacBio and ONT, 102,840 (57.67%) are located in homopolymer runs, and33,058 (18.54%) are located in short tandem repeats. For Illustration,FIG. 5B depicts a known variant in homopolymer runs undetected by allthree sequencing technologies, FIG. 5C depicts a known variant in shorttandem repeats that cannot be detected PacBio and ONT, and FIG. 5Ddepicts a known variant flanked by random sequenced detected by allthree sequencing technologies.

As evidenced by the example experiments, the present embodiments of amulti-task convolutional deep neural network for variant calling usingsingle molecule sequencing has a performance at least on-par with GATKUnifiedGenotyper on Illumina data and outperforms Nanopolish andDeepVariant on PacBio and ONT data. Advantageously, the presentembodiments have been experimentally verified to be the first method forSMS to finish a whole genome variant calling within two hours on asingle CPU-only server, while providing better precision and recall thanother variant callers such as Nanopolish. For the well-characterizedNA12878 human sample, the present embodiments achieve 99.67%, 95.78%,90.53% F1-score on 1KP common variants, and 98.65%, 92.57%, 87.26%F1-score for whole-genome analysis, using Illumina, PacBio, and OxfordNanopore data, respectively. Training on a second human sample shows thepresent embodiments are sample agnostic and finds variants in less thantwo hours on a standard server.

In further embodiments, an artificial neural network can be used as adiscriminator as a substitute to expert review on clinically significantgenomics variants. Many rare diseases and cancers are fundamentallydiseases of the genome. Genome sequencing has become one of the mostimportant tools in clinical practice for rare disease diagnosis andtargeted cancer therapy. However, variant interpretation remains asignificant bottleneck due to lack of sufficient automation and becauseit may take a specialist several hours of work per patient. On average,one-fifth of this time is spent on visually confirming the authenticityof the candidate variants. Embodiments of the present disclosure havebeen experimentally verified by the present inventors to run in lessthan one minute to automatically review ten thousand variants, andapproximately 30 minutes to review all variants in a typicalwhole-genome sequencing sample. Among the false positive singletonsidentified by GATK HaplotypeCaller, UnifiedGenotyper and 16GT in theHG005 GIAB sample, 79.7% were determined to be rejected by the presentembodiments. Worked on the Variants with Unknown Significance (VUS), thepresent embodiments were able to mark most of the false positivevariants for manual review and most of the true positive variants as noneed for review.

The dramatic reduction in the cost of whole genome, exome and ampliconsequencing has allowed these technologies to be increasingly accessiblefor genetic testing, opening the door to broad applications in Mendeliandisorders, cancer diagnosis and personalized medicine. However,sequencing data include both systematic and random errors that hinderany of the current variant identification algorithms from workingoptimally. Using most approaches, typically 1-3% of the candidatevariants are false positives with Illumina sequencing. With the help ofa genome browser such as IGV, or web applications such as VIPER, aspecialist can visually inspect a graphical layout of the readalignments to assess supporting and contradicting evidence to make anarbitration. Though necessary, this is a tedious and fallible procedurebecause of three major drawbacks: 1) it is time-consuming and empiricalstudies report it requires about one minute per variant, sometimessumming up to a few hours per patient; 2) it is tedious, not infallible,and even experienced genetic-specialists might draw differentconclusions for a candidate variant with limited or contradictingevidence; and 3) there is no agreed standard between genetic-specialiststo judge various types of variants, including SNPs (Single NucleotidePolymorphisms) and Indels. A specialist might be more stringent on SNPsbecause there are more clinical assertions and fewer candidate SNPs willbe less likely to get contradicting medical conclusions, whereas anotherspecialist might be more demanding on indels because they are rarer andharder to be identified.

The present embodiments advantageously provide an efficient, accurateand consistent computational method that automates assessing thecandidate variants. Additionally, the validation method of the presentembodiments is orthogonal, i.e., independent of the algorithms used toidentify the candidate variants. Additionally, the validation method ofthe present embodiments is able to capture the complex non-linearrelationship between read alignments and the authenticity of a variantfrom a limited amount of labeled training data. Variant validation is atask with a different nature from variant filtration. The presentembodiments are able to indicate the need of a variant being manuallyreviewed, as opposed to a hard filter that removes a variant fromconsideration. In some cases, failing to flag a false positive variantfor review is less favorable than flagging a true variant for manualreview, i.e., as a validation method, the precision must be maximized,and false positives must be minimized. Consequently, instead of usinghand-coded models or rule-based learning, the present embodiments employa more powerful and agnostic machine learning approach such as anartificial neural network.

Embodiments of the system 100 can be used as an orthogonal approach forvalidating of candidate variants from a separate variant caller. Inthese embodiments, the system 100 uses the discriminator module 156 toprovide fast and accurate validation of candidate variants from theseparate variant caller. The separate variant caller can include; forexample, GATK HaplotypeCaller, Freebayes, or the like. The discriminatormodule 156 can automatically determine whether the evidence supports orcontradicts sequencing read alignments output by the separate variantcaller. Using the CNN described herein, a probability of each possibleoption for multiple categories can be generated by the CNN. Thecategories include 1) variant type, 2) alternative allele, 3) zygosity,and 4) indel-length. A candidate variant output by the variant callercan then be compared to a prediction on each category. The discriminatormodule 156 will agree with a variant if all categories are matched butwill reject and provide possible corrections if any category isunmatched. The CNN model for the discriminator can be trained usingknown variants having known output variables, as described herein. Theknown variants can be sourced from, for example, sequencing librariessuch as those prepared by either a polymerase chain reaction (PCR) orthe PCR-free protocol. With a trained model, the discriminator module156 accepts, from the separate variant caller, an input with candidateSNPs and Indels (for example, as a Variant Call Format (VCF) file) andan input with read alignments (for example, as a Binary Alignment/Map(BAM) file). The discriminator module 156 outputs a judgment and aquality score on how confident the judgment was made for each candidatevariant. In an example, the discriminator module 156 can use Tensorflow,in some cases, that has been tuned to maximize its speed.

Turning to FIG. 7 , a flowchart for a method of discriminating genomicvariants, according to an embodiment, is shown.

At block 702, the input module 150 obtains genomic data, for example,from the network interface 110, the input interface 106, or the database116.

At block 703, the input module 150 receives candidate variant data thatis outputted from a separate variant caller. The variant data isassociated with the genomic data and includes candidate variants andproperties of the candidate variants including all, or a subset of,categories including: 1) variant type, 2) alternative allele, 3)zygosity, and 4) indel-length.

At block 704, the input module 150 extracts overlapping sequencing readalignments from the genomic data and determines variants where the readalignments differ.

At block 706, the input module 150 transforms the overlapping sequencingread alignments and associated variants into a multi-dimensional tensor.In a particular case, a shape of the multi-dimensional tensor can havethree dimensions; a first dimension corresponding to a position of avariant, a second dimension corresponding to a count of a base (A, C, G,or T) on a sequencing read, and a third dimension corresponding todifferent ways of counting bases.

At block 708, the machine learning module 152 passes themulti-dimensional tensor through a trained machine learning architectureto generate probabilities for outcomes in each category of outputvariables by minimizing a cost function iterated over each variant. Thecategorical output variables include alternate base at asingle-nucleotide polymorphism, zygosity of a variant, variant type, andlength of an indel. The machine learning architecture is trained using agenomic dataset with previously identified variants (i.e., truthvariants). The trained machine learning architecture can have the samestructure as the other embodiments described herein.

At block 710, the discriminator module 156 compares each of one or morecandidate variants against the machine learning model's prediction oneach category for that candidate variant. The discriminator module 156will output a positive condition for a variant if all categories arematched and will output a negative condition if any category isunmatched. In some embodiments, the discriminator module 156 alsooutputs the output condition and a quality score on how confident thejudgment was made for each candidate variant. In a particular case, thequality score can be determined using a Phred-scale of the distancebetween the best prediction and the second-best prediction of themachine learning model. The best prediction being an average of the bestprobability of the four categories. The second-best prediction being anaverage of the second-best probability of the four categories.

At block 712, the output module 154 outputs the output condition, and insome cases, the quality score. In some cases, the output module 154 alsooutput the categorical output variables outputted by the trained machinelearning architecture.

In example experiments, the present inventors verified the ability ofembodiments of the system 100 to validate candidate variants. Theexample experiments used four deeply Illumina sequenced genomes (HG001,HG002, HG003, and HG004) with 13.5 M known truth variants from theGenome In A Bottle (GIAB) database. Using the truth variants, the neuralnetwork was trained to recognize how the truth variants are differentfrom another 20 M non-variants that were randomly sampled from the fourgenomes. For benchmarking and identifying the false positive variantcalls, the known truth variants in HG005 were used; which were notincluded in the model training. A false positive variant is defined as avariant called by a variant caller but cannot be found in the HG005 GIABtruth dataset and will be used for the subsequent analysis. It washypothesized that the false positive variants that are supported by onlyone variant caller, but not the other variant callers are very likely tobe erroneous and should be marked for manual review (i.e., rejected) bythe system 100. Thus, variants were called using three different variantcallers with different calculation models, including GATKHaplotypeCaller (HC), GATK UnifiedGenotyper (UG) and 16GT. FIG. 8illustrates a Venn diagram of a variant set called by the three callers,which comprise seven different types of variants: 1) three types ofsingleton variants that have support from only one caller, 2) threetypes of doubleton variants that have support from two of the threecallers, and 3) one type of tripleton variant that is supported by allthree callers. Empirically, doubleton and especially tripleton variantsare relatively less likely to be real false positives and should be lesslikely to be rejected by the system 100. Conversely, singletons calledby only one caller are more likely to be genuine false positive andshould be more likely to be rejected by the system 100.

The results of the example experiments are shown in FIG. 8 , showing thevariant calling results of GATK HaplotypeCaller, GATK UnifiedGenotyper,and 16GT. The Venn diagram shows 1) the precision rate (P), recall rate(R) and f1-score (F) of each variant caller on all variants of theentire HG005 genome, and 2) the number of false positive variantsproduced by each variant caller. The bars graphs shows the number offalse positive variants rejected or agreed by the system 100. The barlength is proportionate to the total number of false positive variantsin that type. Only 18.64% of the tripleton variants were rejected while79.70% of the singletons were rejected by the system 100. Thosedoubletons have an intermediate 45.11% rejected by the system 100. Inthe true positive variants, only 1,879/3,232,539 (0.058%) in HC,43/2,902,052 (0.0014%) in UG, and 124/3,228,537 (0.0038%) in 16GT wererejected. By deducting the rejected variants from both the number oftrue positives and true negatives, the precision increased from 99.77%to 99.92% for HC, 99.50% to 99.58% for UG and 99.51% to 99.84% for 16GT.

In other example experiments, the present inventors verified the abilityof embodiments of the system 100 to validate candidate variants. Insteadof fully automatizing and thus removing manual review, which may not beallowed in a stringent clinical context that emphasizes accountability,the system 100 can be used to help medical professionals prioritizewhich variants they should likely invest efforts in furtherinvestigation and lab validation. In practice, those variantscategorized as “Pathogenic” or “Likely Pathogenic” are rare and shouldbe given priority, thus all these variants are preferred to be manuallyreviewed. “Benign”, and most of the time together with the “LikelyBenign” category, suggest variants without much value in clinicaldiagnosis and therapy, thus likely not requiring manual review. Theother category, named Variant of Unknown Significance, or VUS, containsvariants that are potentially impactful, and requires doctors to sortthrough them. With some approaches, the number of VUS can typically betens to even hundreds of time larger than the sum of other categories.Thus, the present embodiments can benefit the clinical practitionersbecause it has been found to significantly decrease the number VUS to bemanually reviewed.

In the example experiments, to assess the intended function, GATKHaplotypeCaller was run on the HG002 sample. In total about 5 M variantswere called. Then the variants were annotated using SeattleSeq version151 (with dbSNP v151). Variants were extracted that were 1) not in dbSNP(RSID tag equals to 0), and 2) are in a human gene (GL tag not empty).Then, the system 100 was run on the extracted variants with a modeltrained on four samples including HG001, HG003, HG004, and HG005, andannotated the variants as either true positive (TP) or false positive(FP) against the HG002 GIAB truth dataset. The system 100 performedsubstantially well, and the results are shown in TABLE 5.

TABLE 5 PASS CHECK # % # % TP SNP 4,837 99.7% 14 0.3% Indel 7,126 74.5%2,434 25.5% FP SNP 117 46.6% 134 53.4% Indel 41 21.7% 148 78.3%

In the example experiments, for SNPs, 53.4% of the FPs were flagged formanual review, while only 0.3% of the TPs were flagged. For Indels,78.3% of the FPs were flagged for manual review, while only 25.5% of theTPs were flagged. A higher rate of TP Indels were flagged for manualreview because longer Indels are usually more error-prone and can leadto more severe clinical consequences than SNPs, thus it was decided torequire all Indels to be manually reviewed. As shown, the exampleexperiments illustrate the system 100 has significantly more FP variantsmarked for manual review than TP variants, verifying the system's 100effectiveness.

Advantageously, the present embodiments can relieve users from a heavymanual review workload without compromising accuracy. In some cases,instead of taking over the review of all variants, the system 100 can beconfigured to review only 1) SNPs with a single alternative allele, and2) Indels ≤4 bp. The system 100 can also output a quality score rangingfrom 0 to 999 to indicate the confidence of a judgment. Advantageously,the system 100 can require less than a gigabyte of memory and less thana minute on one CPU core to review ten thousand variants, thus can beeasily integrated into existing manual review workflows with minimalcomputational burden. Using 24 CPU cores, the system 100 was able toreview all five million whole genome sequencing variants of the HG002sample in 30 minutes. Advantageously, the present embodiments greatlyreduce the workload on reviewing variants, and thus can greatly increasethe productivity of genetic-specialists in clinical practice.

Although the invention has been described with reference to certainspecific embodiments, various modifications thereof will be apparent tothose skilled in the art without departing from the spirit and scope ofthe invention as outlined in the claims appended hereto.

The invention claimed is:
 1. A computer-implemented method for variantcalling in single molecule sequencing from a genomic dataset using aconvolutional deep neural network, the method comprising causing one ormore processors of the computer to perform steps of: obtaining variantsof the genomic dataset; transforming properties of each of the variantsinto a multi-dimensional tensor; passing the multi-dimensional tensorthrough a convolutional deep neural network trained to predictcategorical output variables, the convolutional deep neural networkminimizing a cost function iterated over each variant, the convolutionaldeep neural network trained using a training genomic dataset comprisingpreviously identified variants, the convolutional neural networkcomprising: an input layer; a plurality of pooled convolutional layersconnected sequentially after the input layer, each pooled convolutionallayer taking an input, applying at least one convolutional operation tothe input, and applying a pooling operation to the output of theconvolutional operation; and at least two fully-connected layersconnected sequentially after the last of the pooled convolutionallayers, the at least two fully-connected layers comprising a secondfully-connected layer connected sequentially after a firstfully-connected layer; and outputting the predicted categorical outputvariables; wherein, for each variant, the categorical output variablescomprise: a first categorical output variable comprising an alternatebase at a single-nucleotide polymorphism; a second categorical outputvariable comprising zygosity; a third categorical output variablecomprising type; and a fourth categorical output variable comprisinglength of an insertion or deletion of bases; the first categoricaloutput variable is selected from a group comprising adenine (A),cytosine (C), guanine (G), and thymine (T); the second categoricaloutput variable is selected from a group comprising Homozygote andHeterozygote; the third categorical output variable is selected from agroup comprising Reference, SNP, Insertion, and Deletion; the fourthcategorical output variable is selected from a group comprising 0, 1, 2,3, 4, and greater than 4; the first categorical output variable isoutput from the first fully-connected layer, and the second categoricaloutput variable, the third categorical output variable, and the fourthcategorical output variable are outputted from the secondfully-connected layer; and the plurality of pooled convolutional layerscomprises three pooled convolutional layers.
 2. The method of claim 1,wherein the first fully-connected layer comprises a regression analysisusing a sigmoid activation function, and the second fully-connectedlayer comprises a softmax classification analysis.
 3. The method ofclaim 1, wherein the plurality of pooled convolutional layers comprisesexactly three pooled convolutional layers.
 4. A computer-implementedmethod for variant calling in single molecule sequencing from a genomicdataset using a convolutional deep neural network, the method comprisingcausing one or more processors of the computer to perform steps of:obtaining variants of the genomic dataset; transforming properties ofeach of the variants into a multi-dimensional tensor; passing themulti-dimensional tensor through a trained convolutional deep neuralnetwork to predict categorical output variables, the convolutional deepneural network minimizing a cost function iterated over each variant,the convolutional deep neural network trained using a training genomicdataset comprising previously identified variants, the convolutionalneural network comprising: an input layer; a plurality of pooledconvolutional layers connected sequentially after the input layer, eachpooled convolutional layer taking an input, applying at least oneconvolutional operation to the input, and applying a pooling operationto the output of the convolutional operation; and at least twofully-connected layers connected sequentially after the last of thepooled convolutional layers, the at least two fully-connected layerscomprising a second fully-connected layer connected sequentially after afirst fully-connected layer; and outputting the predicted categoricaloutput variables; wherein, for each variant, the categorical outputvariables comprise: a first categorical output variable comprising analternate base at a single-nucleotide polymorphism; a second categoricaloutput variable comprising zygosity; a third categorical output variablecomprising type; and a fourth categorical output variable comprisinglength of an insertion or deletion of bases; and the multi-dimensionaltensor comprises a first dimension corresponding to a position of thevariant in addition to flanking base pairs, a second dimensioncorresponding to a count of a base on a sequencing read, and a thirddimension corresponding to a number of techniques for counting bases. 5.The method of claim 4, wherein the first dimension comprises sixteenflanking base pairs on both sides of the variant for a total dimensionof thirty-three, the second dimension comprises adenine (A), cytosine(C), guanine (G), and thymine (T) for a total dimension of four, and thethird dimension comprises four techniques for counting bases for a totaldimension of four.
 6. The method of claim 5, wherein the four techniquescomprise a first technique for encoding a reference sequence and anumber of reads supporting reference alleles, and the second techniqueencodes inserted sequences, the third technique encodes deletedbase-pairs, and the fourth technique encodes alternative alleles.
 7. Asystem for variant calling in single molecule sequencing from a genomicdataset using a convolutional deep neural network, the system comprisingone or more processors and one or more non-transitory computer storagemedia, the one or more non-transitory computer storage media causing theone or more processors to execute: an input module to obtain variants ofthe genomic dataset and transform properties of each of the variantsinto a multi-dimensional tensor; a machine learning module to pass themulti-dimensional tensor through a convolutional deep neural networktrained to predict categorical output variables, the convolutional deepneural network minimizing a cost function iterated over each variant,the convolutional deep neural network trained using a training genomicdataset comprising previously identified variants, the convolutionalneural network comprising: an input layer; a plurality of pooledconvolutional layers connected sequentially after the input layer, eachpooled convolutional layer taking an input, applying at least oneconvolutional operation to the input, and applying a pooling operationto the output of the convolutional operation; and at least twofully-connected layers connected sequentially after the last of thepooled convolutional layers, the at least two fully-connected layerscomprising a second fully-connected layer connected sequentially after afirst fully-connected layer; and an output module to output thepredicted categorical output variables; wherein, for each variant, thecategorical output variables comprise: a first categorical outputvariable comprising an alternate base at a single-nucleotidepolymorphism; a second categorical output variable comprising zygosity;a third categorical output variable comprising type; and a fourthcategorical output variable comprising length of an insertion ordeletion of bases; the first categorical output variable is selectedfrom a group comprising adenine (A), cytosine (C), guanine (G), andthymine (T); the second categorical output variable is selected from agroup comprising Homozygote and Heterozygote; the third categoricaloutput variable is selected from a group comprising Reference, SNP,Insertion, and Deletion; the fourth categorical output variable isselected from a group comprising 0, 1, 2, 3, 4, and greater than 4; thefirst categorical output variable is output from the firstfully-connected layer, and the second categorical output variable, thethird categorical output variable, and the fourth categorical outputvariable are outputted from the second fully-connected layer; and theplurality of pooled convolutional layers comprises three pooledconvolutional layers.
 8. The system of claim 7, wherein the firstfully-connected layer comprises a regression analysis using a sigmoidactivation function, and the second fully-connected layer comprises asoftmax classification analysis.
 9. The system of claim 7, wherein theplurality of pooled convolutional layers comprises exactly three pooledconvolutional layers.